Cars4U Project

Definition of the problem

There is a huge demand for used cars in the Indian Market today. As sales of new cars have slowed down in the recent past, the pre-owned car market has continued to grow over the past years and is larger than the new car market now. Cars4U is a budding tech start-up that aims to find footholes in this market.

In 2018-19, while new car sales were recorded at 3.6 million units, around 4 million second-hand cars were bought and sold. There is a slowdown in new car sales and that could mean that the demand is shifting towards the pre-owned market. In fact, some car sellers replace their old cars with pre-owned cars instead of buying new ones. Unlike new cars, where price and supply are fairly deterministic and managed by OEMs (Original Equipment Manufacturer / except for dealership level discounts which come into play only in the last stage of the customer journey), used cars are very different beasts with huge uncertainty in both pricing and supply. Keeping this in mind, the pricing scheme of these used cars becomes important in order to grow in the market.

As a senior data scientist at Cars4U, you have to come up with a pricing model that can effectively predict the price of used cars and can help the business in devising profitable strategies using differential pricing. For example, if the business knows the market price, it will never sell anything below it.

Project Objective

  1. Explore and visualize the dataset.
  2. Build a linear regression model to predict the prices of used cars.
  3. Generate a set of insights and recommendations that will help the business.

Assumptions

The used car data is a simple random sample from the population data.

About the data

used_cars_data.csv - contains information about used cars.

  1. S.No. : Serial Number
  2. Name : Name of the car which includes Brand name and Model name
  3. Location : The location in which the car is being sold or is available for purchase Cities
  4. Year : Manufacturing year of the car
  5. Kilometers_driven : The total kilometers driven in the car by the previous owner(s) in KM.
  6. Fuel_Type : The type of fuel used by the car. (Petrol, Diesel, Electric, CNG, LPG)
  7. Transmission : The type of transmission used by the car. (Automatic / Manual)
  8. Owner : Type of ownership
  9. Mileage : The standard mileage offered by the car company in kmpl or km/kg
  10. Engine : The displacement volume of the engine in CC.
  11. Power : The maximum power of the engine in bhp.
  12. Seats : The number of seats in the car.
  13. New_Price : The price of a new car of the same model in INR Lakhs.(1 Lakh = 100, 000)
  14. Price : The price of the used car in INR Lakhs (1 Lakh = 100, 000)

Exploratory Data Analysis

Import the Python libraries

Read the data into the notebook

View the first and last 5 rows of the dataset

Understand the shape of the dataset.

Check the data types of the columns for the dataset.

Fixing the data types

coverting "objects" to "category" reduces the space required to store the dataframe. It also helps in analysis

Observations

Five point summary of continuous variables

Summary of categorical variables

Dropping unnecessary variables

We will drop the following variables/columns:

Check for missing values

Null values (out of 7253 rows)

Cleanup variables

The following variables are identified as 'object' because they have characters that is affecting the numeric values. These can be cleanup in order for Python to recognize them as numeric.

Handle Nulls again

Fix missing values replacing with the Median

Finally, let convert Seats variable into categorical since this is a Numerical Discrete variable.

EDA

Univariate analysison Numerical Variables

Price

Observation

Kilometers_Driven

Observation

Mileage

Observation

Engine

Observation

Observation

Distribution of each numerical variable

Observation

Outlier Analysis in every numerical column

Fix missing values replacing with the Median

Univariate Analysis on Categorical Variables

Group Name into 'Make' value

Since the Name variable has 2041 unique value, we will try to group these into the 'Make' of the vehicle.We can try to split the Name and capture the first word which seems to be the vehicle 'Make' value.

NameMake

0 Name 7253 non-null category 1 Location 7253 non-null category 2 Year 7253 non-null category 4 Fuel_Type 7253 non-null category 5 Transmission 7253 non-null category 6 Owner_Type 7253 non-null category 10 Seats 7200 non-null category

Bivaraite Analysis

Analyze Correlations

Observations

Variables that are highly correlated with Price

Price vs Engine vs NameMake

Observations

Price vs Engine vs Location

Observations

Price vs Engine vs Year

Observations

Price vs Engine vs Fuel_Type

Observations

Price vs Engine vs Transmission

Observations

Price vs Engine vs Owner_Type

Observations

Price vs Engine vs Seats

Observations

Price vs Power vs NameMake

Observations

Price vs Power vs Location

Observations

Price vs Power vs Year

Observations

Price vs Power vs Fuel_Type

Observations

Price vs Power vs Transmission

Observations

Price vs Power vs Owner_Type

Observations

Price vs Power vs Seats

Observations

Price vs Mileage vs NameMake

Observations

Price vs Mileage vs Location

Observations

Price vs Mileage vs Year

Observations

Price vs Mileage vs Fuel_Type

Observations

Price vs Mileage vs Transmission

Observations

Price vs Mileage vs Owner_Type

Observations

Price vs Mileage vs Seats

Observations

Outliers Treatment

Model Building

Create Dummy Variables

Split the data into train and test

Choose Model, Train and Evaluate in order to create the Model

Evaluate Model Performances

Sum of Squares Regression

The sum of the differences between the Predicted value and the Mean of the Dependant variable (Price). this defines how well the prdicted line fit our data. If SSR (Sum of Squares) is equal to the SST (Sum of Squares Total), It means that the regression model captures all the observed variablility.

The mean absolute error (MAE) calculates the residual for every data point, taking only the absolute value of each so that negative and positive residuals do not cancel out. We then take the average of all these residuals. Effectively, MAE describes the typical magnitude of the residuals.

The root mean square error (RMSE) is just like the MAE, but squares the difference before summing them all instead of using the absolute value. And then takes the square root of the value.

Sum of Squares Error (SSE) Onservation

we can choose to use the Mean Absolute Error, since it gives us the lower error margin

R2 (coefficient of determination) regression score function.

R2 Observation

Conclusion

Model Statistics

Ordinarey Least Squares (OLS)

Observation

Interpreting the Regression Results:

  1. Adjusted. R-squared: It reflects the fit of the model.
    • R-squared values range from 0 to 1, where a higher value generally indicates a better fit, assuming certain conditions are met.
    • In our case, the value for Adj. R-squared is 0.69, which is good.
  2. const coefficient is the Y-intercept.

    • It means that if all the dependent variables (features: like Country, status, Adult mortality and so on..) coefficients are zero, then the expected output (i.e., the Y) would be equal to the const coefficient.
    • In our case, the value for const coeff is 12.47
  3. Schooling coeff: It represents the change in the output Y due to a change of one unit in the Schooling (everything else held constant).

  4. std err: It reflects the level of accuracy of the coefficients.
    • The lower it is, the higher is the level of accuracy.
  5. P >|t|: It is p-value.

    • Pr(>|t|) : For each independent feature there is a null hypothesis and alternate hypothesis

      Ho : Independent feature is not significant

      Ha : Independent feature is that it is significant

Pr(>|t|) gives P-value for each independent feature to check that null hypothesis. we are considering 0.05 (5%) as significance level

  1. Confidence Interval: It represents the range in which our coefficients are likely to fall (with a likelihood of 95%).

Linear Regression Assumptions

TEST FOR MULTICOLLINEARITY

Removing Multicollinearity

* Earlier R-squared was 0.69, now it is reduced to 0.68

Mileage 0.0199 0.013 1.518 0.129 -0.006 0.046 Fuel_Type_LPG -0.4881 0.806 -0.606 0.545 -2.068 1.092 Fuel_Type_Petrol -0.1526 0.372 -0.411 0.681 -0.881 0.576 Owner_Type_Fourth & Above -0.6256 0.799 -0.783 0.434 -2.193 0.942 NameMake_FORCE 0.2047 1.640 0.125 0.901 -3.010 3.419 NameMake_ISUZU -0.9454 1.165 -0.811 0.417 -3.229 1.339 NameMake_LAMBORGHINI 4.0395 2.445 1.652 0.099 -0.754 8.833 NameMake_MITSUBISHI -0.4303 0.541 -0.795 0.427 -1.492 0.631 NameMake_OPELCORSA 0.9564 2.306 0.415 0.678 -3.564 5.477 NameMake_SMART -4.108e-16 4.05e-16 -1.016 0.310 -1.2e-15 3.82e-16

NameMake_DATSUN -1.2946 0.829 -1.561 0.119 -2.920 0.331 NameMake_FORD -0.6055 0.479 -1.265 0.206 -1.544 0.333 NameMake_HONDA -0.7279 0.465 -1.564 0.118 -1.640 0.185 NameMake_HYUNDAI -0.4112 0.463 -0.888 0.375 -1.319 0.497 NameMake_MARUTI -0.4544 0.465 -0.978 0.328 -1.365 0.457 NameMake_NISSAN -0.7034 0.527 -1.334 0.182 -1.737 0.331 NameMake_RENAULT -0.5078 0.509 -0.998 0.318 -1.505 0.489 NameMake_SKODA -0.4235 0.494 -0.858 0.391 -1.391 0.544 NameMake_VOLKSWAGEN -0.7128 0.478 -1.490 0.136 -1.651 0.225

Now no feature has p value greater than 0.05, so we'll consider features in X_train4_19 as the final ones and olsres7 as final model

Observations Now Adjusted R-squared is 0.680, Our model is able to explain 68% of variance that shows model is good. The Adjusted-R squared in Olsres0 it was 68% (Where we considered all the variables) this shows that the variables we dropped were not affecting the model much.

Mean of residuals should be 0

TEST FOR LINEARITY

TEST FOR NORMALITY